A New Big Data Feature Selection Approach for Text Classification

نویسندگان

چکیده

Feature selection (FS) is a fundamental task for text classification problems. Text feature aims to represent documents using the most relevant features. This process can reduce size of datasets and improve performance machine learning algorithms. Many researchers have focused on elaborating efficient FS techniques. However, proposed approaches are evaluated small validated single machines. As textual data dimensionality becomes higher, traditional methods must be improved parallelized handle big data. paper proposes distributed approach based mutual information (MI) method, which widely applied in pattern recognition learning. A drawback MI that it ignores frequency terms during The proposal introduces namely, Maximum Term Frequency-Mutual Information (MTF-MI), term techniques quality selected implemented Hadoop MapReduce programming model. effectiveness MTF-MI demonstrated through several experiments multinomial Naïve Bayes classifier three datasets. Through series tests, results reveal method improves compared with four state-of-the-art macro-F1 micro-F1 measures.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evolutionary Feature Selection for Big Data Classification: A MapReduce Approach

Nowadays, many disciplines have to deal with big datasets that additionally involve a high number of features. Feature selection methods aim at eliminating noisy, redundant, or irrelevant features that may deteriorate the classification performance. However, traditionalmethods lack enough scalability to copewith datasets ofmillions of instances and extract successful results in a delimited time...

متن کامل

a new approach to credibility premium for zero-inflated poisson models for panel data

هدف اصلی از این تحقیق به دست آوردن و مقایسه حق بیمه باورمندی در مدل های شمارشی گزارش نشده برای داده های طولی می باشد. در این تحقیق حق بیمه های پبش گویی بر اساس توابع ضرر مربع خطا و نمایی محاسبه شده و با هم مقایسه می شود. تمایل به گرفتن پاداش و جایزه یکی از دلایل مهم برای گزارش ندادن تصادفات می باشد و افراد برای استفاده از تخفیف اغلب از گزارش تصادفات با هزینه پائین خودداری می کنند، در این تحقیق ...

15 صفحه اول

A Fuzzy TOPSIS Approach for Big Data Analytics Platform Selection

Big data sizes are constantly increasing. Big data analytics is where advanced analytic techniques are applied on big data sets. Analytics based on large data samples reveals and leverages business change. The popularity of big data analytics platforms, which are often available as open-source, has not remained unnoticed by big companies. Google uses MapReduce for PageRank and inverted indexes....

متن کامل

A Novel One Sided Feature Selection Method for Imbalanced Text Classification

The imbalance data can be seen in various areas such as text classification, credit card fraud detection, risk management, web page classification, image classification, medical diagnosis/monitoring, and biological data analysis. The classification algorithms have more tendencies to the large class and might even deal with the minority class data as the outlier data. The text data is one of t...

متن کامل

A Classification Method for E-mail Spam Using a Hybrid Approach for Feature Selection Optimization

Spam is an unwanted email that is harmful to communications around the world. Spam leads to a growing problem in a personal email, so it would be essential to detect it. Machine learning is very useful to solve this problem as it shows good results in order to learn all the requisite patterns for classification due to its adaptive existence. Nonetheless, in spam detection, there are a large num...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Scientific Programming

سال: 2021

ISSN: ['1058-9244', '1875-919X']

DOI: https://doi.org/10.1155/2021/6645345